
This is a Python Jupyter notebook that can be run in a Colab environment. This notebook is designed to be used as a demonstration of plotly interactive visualizations in MBA 562.
You can follow through by running the code cells sequentially.
Analyze the dataset and answer the following questions:
▶️ Run the code cell below to import packages used in the case.
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
# plotly.io is a low-level interface for interacting with figures/
# plotly.io.templates lists available plotly templates
# https://plotly.com/python-api-reference/plotly.io.html
import plotly.io as pio
pd.set_option('display.max_columns', 50)
Run the code below to ensure that your notebook uses the same Plotly version as the autograder.
print(f'The current plotly version is {plotly.__version__}')
The current plotly version is 5.11.0
▶️ Run the code below to import and process the trips dataset.
# read the trips dataset and uncompress gzip
# parse the "start" column as datetime
df = pd.read_csv(
'https://github.com/bdi475/datasets/raw/main/case-studies/chicago-ridesharing/chicago-ridesharing-trips-2019-2020.csv.gz',
compression='gzip',
parse_dates=['start']
)
# read community areas information
df_community_areas = pd.read_csv('https://github.com/bdi475/datasets/raw/main/case-studies/chicago-ridesharing/chicago-community-area-numbers.csv')
# Replace community area numbers with area names
df = df.merge(df_community_areas, left_on='pickup_area', right_on='area_number', how='left')
df['pickup_area'] = df['community'].copy()
df.drop(columns=['community', 'area_number'], inplace=True)
df = df.merge(df_community_areas, left_on='dropoff_area', right_on='area_number', how='left')
df['dropoff_area'] = df['community'].copy()
df.drop(columns=['community', 'area_number'], inplace=True)
# similar to R tibble's head() function
df.head(5)
| start | trip_seconds | trip_miles | pickup_area | dropoff_area | fare | tip | additional_charges | trip_total | shared_trip_authorized | trips_pooled | pickup_lat | pickup_lon | dropoff_lat | dropoff_lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-01-01 | 303 | 1.4 | Garfield Ridge | Clearing | 5.0 | 5.0 | 7.68 | 17.68 | False | 1 | 41.792592 | -87.769615 | 41.779583 | -87.768511 |
| 1 | 2019-01-01 | 697 | 3.0 | Near North Side | Near West Side | 7.5 | 0.0 | 2.50 | 10.00 | False | 1 | 41.892073 | -87.628874 | 41.885300 | -87.642808 |
| 2 | 2019-01-01 | 1598 | 4.7 | Lincoln Park | Loop | 10.0 | 2.0 | 2.50 | 14.50 | False | 1 | 41.922083 | -87.634156 | 41.870607 | -87.622173 |
| 3 | 2019-01-01 | 573 | 0.9 | Near North Side | Near North Side | 5.0 | 0.0 | 2.50 | 7.50 | False | 1 | 41.892042 | -87.631864 | 41.892508 | -87.626215 |
| 4 | 2019-01-01 | 1562 | 2.4 | Near North Side | Near North Side | 10.0 | 0.0 | 2.50 | 12.50 | False | 1 | 41.900221 | -87.629105 | 41.895033 | -87.619711 |
# df.info() is similar to R's str() function
# in Python, the str() function casts an object to a string type
# in R, the str() function displays the structure of an object
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1388193 entries, 0 to 1388192 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start 1388193 non-null datetime64[ns] 1 trip_seconds 1388193 non-null int64 2 trip_miles 1388193 non-null float64 3 pickup_area 1388193 non-null object 4 dropoff_area 1388193 non-null object 5 fare 1388193 non-null float64 6 tip 1388193 non-null float64 7 additional_charges 1388193 non-null float64 8 trip_total 1388193 non-null float64 9 shared_trip_authorized 1388193 non-null bool 10 trips_pooled 1388193 non-null int64 11 pickup_lat 1388193 non-null float64 12 pickup_lon 1388193 non-null float64 13 dropoff_lat 1388193 non-null float64 14 dropoff_lon 1388193 non-null float64 dtypes: bool(1), datetime64[ns](1), float64(9), int64(2), object(2) memory usage: 160.2+ MB
# similar to R's row() and ncol() functions
# the shape property contains the number of rows and columns
num_rows = df.shape[0]
num_cols = df.shape[1]
# alternatively, use unpacking syntax to combine the two lines into one
# num_rows, num_cols = df.shape
print(f'There are {num_rows} rows and {num_cols} columns in the dataset.')
There are 1388193 rows and 15 columns in the dataset.
The start column contains trip start timestamps. We will extract year, month, day of the month, day of the week, hour, and weekday/weekend information into separate columns.
# the .dt accessor is similar to R lubridate's getter functions
# such as lubridate::month(my_date)
df['year'] = df['start'].dt.year
df['month'] = df['start'].dt.month
df['day'] = df['start'].dt.day
df['dayofweek'] = df['start'].dt.dayofweek
df['hour'] = df['start'].dt.hour
df_sample = df[df.index.isin([400000, 600000, 800000, 1000000, 1200000])]
display(df_sample[['start', 'year', 'month', 'day', 'dayofweek', 'hour']])
| start | year | month | day | dayofweek | hour | |
|---|---|---|---|---|---|---|
| 400000 | 2019-05-28 19:00:00 | 2019 | 5 | 28 | 1 | 19 |
| 600000 | 2019-08-12 19:15:00 | 2019 | 8 | 12 | 0 | 19 |
| 800000 | 2019-10-29 08:00:00 | 2019 | 10 | 29 | 1 | 8 |
| 1000000 | 2020-01-15 12:45:00 | 2020 | 1 | 15 | 2 | 12 |
| 1200000 | 2020-06-09 17:00:00 | 2020 | 6 | 9 | 1 | 17 |
weekday_weekend column¶weekday_weekend in df.# numpy's np.where() is similar to R's ifelse() function
df['weekday_weekend'] = np.where(df['dayofweek'] <= 3, 'weekday', 'weekend')
display(df[['dayofweek', 'weekday_weekend']].sample(5))
| dayofweek | weekday_weekend | |
|---|---|---|
| 209635 | 1 | weekday |
| 444031 | 4 | weekend |
| 1060627 | 5 | weekend |
| 224610 | 6 | weekend |
| 28169 | 5 | weekend |
Although the first case of COVID-19 was reported in January 2020 in the United States, people started to take it seriously in March 2020.
How did COVID-19 affect the volume of ridesharing trips? 💨 Let's visualize and compare the monthly number of trips for both 2019 and 2020.
num_2019_trips = (df['year'] == 2019).sum()
num_2020_trips = (df['year'] == 2020).sum()
print(f'There were {num_2019_trips} trips in 2019.')
print(f'There were {num_2020_trips} trips in 2020.')
print(f'The number of trips decreased by {(num_2019_trips - num_2020_trips) / num_2019_trips * 100:.1f}%.')
There were 966346 trips in 2019. There were 421847 trips in 2020. The number of trips decreased by 56.3%.
Calculate the number of monthly trips and average tip percentages.
df_monthly = df.groupby(['year', 'month'], as_index=False).agg({
'start': 'count',
'tip': 'sum',
'fare': 'sum',
}).rename(columns={
'start': 'num_trips'
})
# Calculate average tip %
df_monthly['tip_pct'] = df_monthly['tip'] / df_monthly['fare']
# Remove tip and fare columns since they're not needed anymore
df_monthly.drop(columns=['tip', 'fare'], inplace=True)
display(df_monthly.head(5))
| year | month | num_trips | tip_pct | |
|---|---|---|---|---|
| 0 | 2019 | 1 | 77313 | 0.048254 |
| 1 | 2019 | 2 | 77374 | 0.045353 |
| 2 | 2019 | 3 | 89688 | 0.047561 |
| 3 | 2019 | 4 | 79829 | 0.047786 |
| 4 | 2019 | 5 | 84168 | 0.049438 |
Create two bar charts side-by-side within a same figure showing the number of monthly trips for 2019 and 2020.
fig = px.bar(
df_monthly,
title='Monthly Number of Trips in 2019 and 2020',
x='month',
y='num_trips',
facet_col='year',
width=1000,
height=500,
template='plotly_dark',
color='num_trips',
color_continuous_scale=['White', 'Yellow']
)
fig.show()
Create a line plot of monthly number of trips. Use year to create two separate lines (encoded by different colors) for 2019 and 2020.
fig = px.line(
df_monthly,
title='Monthly Number of Trips Comparison between 2019 and 2020',
x='month',
y='num_trips',
color='year',
template='plotly_dark',
width=800,
height=500
)
fig.show()
Did passengers tip more on average during the pandemic since they appreciated the drivers providing services during risky times?
Or did the passengers tip less on average since the pandemic has devastated the nation's economy in 2020?
Create two bar charts side-by-side within a same figure showing the monthly average tip percentages in 2019 and 2020.
fig = px.bar(
df_monthly,
title='Monthly Average Tip % Based on Fares in 2019 and 2020',
x='month',
y='tip_pct',
facet_col='year',
width=1000,
height=500,
template='plotly_dark',
color='tip_pct',
color_continuous_scale=['White', 'GreenYellow']
)
fig.update_layout(yaxis_tickformat='%')
fig.update_layout(yaxis2_tickformat='%')
fig.show()
Create a line plot of monthly average tip percentages. Use year to create two separate lines (encoded by different colors) for 2019 and 2020.
fig = px.line(
df_monthly,
title='Monthly Average Tip % Comparison between 2019 and 2020',
x='month',
y='tip_pct',
color='year',
template='plotly_dark',
width=800,
height=500
)
fig.update_layout(yaxis_tickformat='%')
fig.show()
Every July 4th, massive crowds gather around places (or boats) to view the fireworks. How did July 4th become the "national fireworks day"? July 4, 1776 is considered to be the birth of United States of Amercia as an independent nation. The Continental Congress approved the final wording of the Declaration of Independence on July 4, 1776. The first-ever recorded July 4th fireworks celebration was held in Philadelphia on July 4, 1777. Since then, there hasn't been an Independence Day without a firework. 💥💥
In this part, you will find trips started on July 4th between 5-6 PM and create different scatter plots based on those trips.
df_july_fourth = df[(df['month'] == 7) & (df['day'] == 4) & (df['hour'] == 17)]
display(df_july_fourth.sample(3))
| start | trip_seconds | trip_miles | pickup_area | dropoff_area | fare | tip | additional_charges | trip_total | shared_trip_authorized | trips_pooled | pickup_lat | pickup_lon | dropoff_lat | dropoff_lon | year | month | day | dayofweek | hour | weekday_weekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 497058 | 2019-07-04 17:45:00 | 669 | 3.0 | Belmont Cragin | Belmont Cragin | 2.5 | 0.0 | 2.55 | 5.05 | True | 2 | 41.927261 | -87.765502 | 41.927261 | -87.765502 | 2019 | 7 | 4 | 3 | 17 | weekday |
| 497004 | 2019-07-04 17:15:00 | 822 | 4.1 | Near West Side | Near North Side | 10.0 | 0.0 | 2.55 | 12.55 | False | 1 | 41.874177 | -87.661861 | 41.890922 | -87.618868 | 2019 | 7 | 4 | 3 | 17 | weekday |
| 497079 | 2019-07-04 17:45:00 | 729 | 2.4 | Lake View | Near North Side | 7.5 | 0.0 | 2.55 | 10.05 | False | 1 | 41.934659 | -87.646730 | 41.900266 | -87.632109 | 2019 | 7 | 4 | 3 | 17 | weekday |
Create two scatter plots for each year side-by-side within a same figure with the following axes.
fig = px.scatter(
df_july_fourth,
title='Trip Seconds vs Trip Miles with July 4th 5-6 PM Trips',
x='trip_seconds',
y='trip_miles',
size='trip_total',
facet_col='year',
color='shared_trip_authorized',
template='plotly_dark',
width=1200,
height=500,
)
fig.show()
trip_seconds and trip_miles.shared_trips_authorized == True) in 2020.Create a 3D scatter plot of trips in 2019.
⚠️ The 3D scatter plot is only used as a demonstration. You should avoid unnecessary 3D plots at all costs unless you're plotting spatial or scientific data.
fig = px.scatter_3d(
df_july_fourth[df_july_fourth['year'] == 2019],
title='Trip Seconds vs Trip Miles vs Trip Total with July 4th, 2019 5-6 PM Trips',
x='trip_seconds',
y='trip_miles',
z='trip_total',
color='shared_trip_authorized',
template='plotly_dark',
width=800,
height=800
)
fig.show()
Create a 3D scatter plot of trips in 2020.
fig = px.scatter_3d(
df_july_fourth[df_july_fourth['year'] == 2020],
title='Trip Seconds vs Trip Miles vs Trip Total with July 4th, 2020 5-6 PM Trips',
x='trip_seconds',
y='trip_miles',
z='trip_total',
color='shared_trip_authorized',
template='plotly_dark',
width=800,
height=800
)
fig.show()
trip_seconds, trip_miles, and trip_total.In this part, you will find the top 20 pickup areas and analyze the trips originating from those areas.
top_20_pickup_areas = df['pickup_area'].value_counts().head(20).index.tolist()
Filter only the rows where the pickup_area is in one of the top 20 pickup areas.
top_20_pickup_areas = df['pickup_area'].value_counts().head(20).index.tolist()
df_filtered = df[df['pickup_area'].isin(top_20_pickup_areas)]
display(df_filtered.head(3))
print(f'There are {df_filtered.shape[0]} rows and {df_filtered.shape[1]} columns in the filtered DataFrame')
| start | trip_seconds | trip_miles | pickup_area | dropoff_area | fare | tip | additional_charges | trip_total | shared_trip_authorized | trips_pooled | pickup_lat | pickup_lon | dropoff_lat | dropoff_lon | year | month | day | dayofweek | hour | weekday_weekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2019-01-01 | 697 | 3.0 | Near North Side | Near West Side | 7.5 | 0.0 | 2.5 | 10.0 | False | 1 | 41.892073 | -87.628874 | 41.885300 | -87.642808 | 2019 | 1 | 1 | 1 | 0 | weekday |
| 2 | 2019-01-01 | 1598 | 4.7 | Lincoln Park | Loop | 10.0 | 2.0 | 2.5 | 14.5 | False | 1 | 41.922083 | -87.634156 | 41.870607 | -87.622173 | 2019 | 1 | 1 | 1 | 0 | weekday |
| 3 | 2019-01-01 | 573 | 0.9 | Near North Side | Near North Side | 5.0 | 0.0 | 2.5 | 7.5 | False | 1 | 41.892042 | -87.631864 | 41.892508 | -87.626215 | 2019 | 1 | 1 | 1 | 0 | weekday |
There are 1084442 rows and 21 columns in the filtered DataFrame
Calculate the following total number of trips and average trip total by pickup area
df_by_pickup_area = df_filtered.groupby(
'pickup_area', as_index=False
).agg({
'start': 'count',
'trip_total': 'mean',
}).rename(columns={
'start': 'num_trips'
})
display(df_by_pickup_area.head(5))
| pickup_area | num_trips | trip_total | |
|---|---|---|---|
| 0 | Austin | 16851 | 13.246273 |
| 1 | Avondale | 16425 | 12.623803 |
| 2 | Edgewater | 21689 | 13.867470 |
| 3 | Hyde Park | 21303 | 14.523999 |
| 4 | Irving Park | 14570 | 13.326660 |
Create a horizontal bar chart displaying the number of trips by each pickup area.
fig = px.bar(
df_by_pickup_area,
title='Number of Trips by Pickup Area',
x='num_trips',
y='pickup_area',
color='num_trips',
color_continuous_scale='emrld',
text='num_trips',
template='plotly_white',
height=800
)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_yaxes(categoryorder='total ascending')
fig.show()
fig = px.treemap(
df_by_pickup_area,
title='Pickup Area Breakdown',
path=['pickup_area'],
values='num_trips',
height=600
)
fig.show()
In this part, you will add a new dimension (weekday/weekend) to the top 20 pickup areas.
Calculate the trip statistics by pickup area and weekday/weekend classification.
df_by_pickup_area_weekday_weekend = df_filtered.groupby(
['pickup_area', 'weekday_weekend'], as_index=False
).agg({
'start': 'count',
'trip_total': 'mean',
}).rename(columns={
'start': 'num_trips'
})
display(df_by_pickup_area_weekday_weekend.head(10))
| pickup_area | weekday_weekend | num_trips | trip_total | |
|---|---|---|---|---|
| 0 | Austin | weekday | 9064 | 13.213864 |
| 1 | Austin | weekend | 7787 | 13.283996 |
| 2 | Avondale | weekday | 7942 | 12.617546 |
| 3 | Avondale | weekend | 8483 | 12.629662 |
| 4 | Edgewater | weekday | 10745 | 14.131752 |
| 5 | Edgewater | weekend | 10944 | 13.607994 |
| 6 | Hyde Park | weekday | 11852 | 14.162735 |
| 7 | Hyde Park | weekend | 9451 | 14.977041 |
| 8 | Irving Park | weekday | 7307 | 13.409209 |
| 9 | Irving Park | weekend | 7263 | 13.243611 |
fig = px.sunburst(
df_by_pickup_area_weekday_weekend,
path=['pickup_area', 'weekday_weekend'],
values='num_trips',
title='Pickup Area Breakdown (Weekday/Weekend)',
width=800,
height=800
)
fig.show()
fig = px.treemap(
df_by_pickup_area_weekday_weekend,
path=['pickup_area', 'weekday_weekend'],
values='num_trips',
title='Pickup Area Breakdown (Weekday/Weekend)',
height=600
)
fig.show()